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Abstract 

Little researeh has examined individual linguistic features that influence English language 
learners (ELLs) test performance. Eurthermore, researeh has yet to explore the relationship 
between the seienee strand of test items and the types of linguistic features the items include. 
Utilizing Differential Item Eunctioning, this study examines EEE performance on 162 Grade 5 
large-seale seienee multiple-ehoice test items and its relationship to two diseourse features, as 
well as the distribution of these features aeross three seienee strands. We also interviewed 52 
EEEs to examine their interaction with these two features. Results indicate that these two 
features were most frequent on items under the Eife Seienee strand. Additionally, both of these 
features were signifieantly correlated with DIE disfavoring EEEs indicating that the presenee of 
these features potentially hinder EEEs’ abilities to understand test items containing these 
features. Interviews confirmed that these two features in eombination interfered with EEEs’ 
abilities to make sense of the items, which often resulted in students answering incorrectly, even 
when they demonstrated knowledge of the content. Beeause the features are most frequent on 
Eife Science items, EEEs’ content knowledge on these topics may be severely underestimated 
for this strand. 
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Do linguistic features of science test items prevent 
English Language Learners from demonstrating their knowledge? 

Accurately assessing English language learners’ (ELLs’) knowledge in science is a complex 
and pressing issue. While many existing assessments have been shown to be inaccurate measures 
of EEEs’ content knowledge (Abedi, et ah, 2005; Authors, 2012; Sato et ah, 2010), new and 
increasingly linguistically complex assessments are being developed. To better assess EEEs’ 
science knowledge, many argue that current testing practices need to (a) incorporate theories that 
acknowledge that EEEs draw from two language systems (Valdes & Eigueroa, 1994), (b) 
consider sociolinguistic aspects of interactions between students and test items (Solano-Elores, 
2006; 2008), and (c) include EEEs in the test development process (Abedi & Hefri, 2004). 

Testing accommodations are the main tool used to improve the accessibility of science tests 
for EEEs, and can include the use of bilingual dictionaries, extra time, translated tests, and 
linguistic simplification (Duran, 2008; Rivera, et ah, 2006). While results on the effectiveness of 
testing accommodations have been mixed (Abedi & Hefri, 2004; Kieffer, et ah, 2009; Pennock- 
Roman, & Rivera, 2011; Sireci, Ei, & Scarpati, 2003), some show improved EEE performance 
with reduced linguistic complexity of test items (Abedi, Courtney, & Eeon, 2003; Sato et ah, 
2010 ). 

However, linguistic complexity of test items is not consistently defined from one study to the 
next, and specific linguistic features that contribute to linguistic complexity are frequently 
neither defined nor operationalized (Authors, in preparation). As a result, it is difficult to 
replicate findings from individual studies and to accumulate knowledge about how specific 
linguistic features of test items interfere with the performance of EEE students. Eurthermore, 
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mueh of the researeh on the effeets of linguistic complexity of test items on ELLs’ performance 
has focused on mathematies test items, although the ehallenges of the language of science test 
items for ELLs are likely to be even greater (Penfield & Lee, 2010). In addition, studies that have 
investigated the linguistic complexity of seience test items have not explored the relationship 
between the science strand of test items and the types of linguistic features the items include. We 
eonelude that more researeh is needed to understand the sourees of problematic linguistic 
complexity for ELL students on scienee test items, and the relationship between linguistic 
complexity and seience strand. Consequently, this study examines ELL performance on science 
test items from the Grade 5 Massaehusetts Comprehensive Assessment System (MCAS) and its 
relationship to specifie linguistic features of these test items, as well as the distribution of these 
features aeross three seience strands: Earth and Space Science (ESS), Life Seienee (LS), and 
Physieal Sciences (PS). 

Conceptual Framework 

Understanding a test item is erueial for accurately solving it (Leighton & Gokiert, 2008). 
Discourse features are essential for constructing this understanding, as they affect students’ 
comprehension of the item as a whole. We conceptualize the process of item comprehension as 
the integration of text and background knowledge to create mental representations that capture 
the details of what is read (e.g., who, when, why) in the form of a Situation Model (Zwann & 
Radvansky, 1998). In assessments, the student must also construe! a Problem Model, whieh 
consists of what the student needs to know from the text and what needs to be done with that 
information to correctly answer the item (Nathan, Kintseh, & Young, 1992). 

We hypothesize that some discourse features may prevent ELLs from ereating the intended 
Situation and Problem Models for test items. In previous work, we found one such feature. 
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Foreed Comparison, to interfere with the construction of the intended Situation and Problem 
Models for students from diverse backgrounds (Authors, 2008; 2012). The Forced Comparison 
feature occurred in items asking the student to compare all the answer ehoiees and seleet the 
option with an extreme value, such as the best or most likely choice. Subsequent analysis found 
that this feature often co-oceurred with another discourse feature, Referenee Back, which 
required students to return to a previous sentence to find information necessary to answer the 
question (e.g.. How would this change affect the plant population?). To better understand the 
role of these discourse features in ELLs’ performance on scienee test items aeross strands, we 
ask: 

• What is the relationship between scienee strand and ELL performance? 

• What are the frequeneies and distributions of these two discourse features across science 
strands in these test items? 

• What is the relationship between these discourse features and ELLs’ performance? 

• What do student interviews reveal about how ELLs interact with these features? 

Methods 

This report is part of a 4-year study, eurrently in progress, investigating the effeets of specific 
linguistic features of test items on ELLs’ performance on large-seale standardized multiple- 
ehoice items. These features include aspects at the word level, sentence level, and item level (see 
Kaehchaf et al. Submitted). Lor the purposes of this paper, we foeus on two item level features, 
discussed below. 

Quantitative Data Sources 

Students. Student performanee for the correlation analysis was calculated for Grade 5 
students from three elassifieations of English proficiency: non-ELLs, Limited English Proficient 
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(LEP), and Formerly Limited English Proficient (FLEP), as determined by the state. For eaeh 
year, this statewide study included (a) 52,694 - 56,991 non-ELL students, (b) 2,645 - 3,804 LEP 
students, and (e) 1,761 - 2,466 FLEP students. 

Items. We correlated student performance on 162 publicahy released science multiple-choice 
items from a state mandated science exam for the years 2004-2010. These items covered three 
science strands: Earth and Spaee Scienee, Physical Science, and Life Seience. 

Linguistic Features of Test Items. A comprehensive literature synthesis filtered existing 
studies that investigated the role of linguistie complexity in ELL test performance (see Noble, 
Kaehchaf, & Rosebery, In Preparation). Aeross the 1 1 studies identified in the literature 
synthesis, over 60 linguistic features were identified as potentially influencing ELL performance. 
From these 60 features, the project team selected 13 features at the word, sentenee, and item 
level to analyze (for details on features not discussed in this study see Kaehchaf et al., 

Submihed). However, the majority of item level features found in previous research were 
difficult to replicate due to a laek of information provided on how they were operationalized. 
Therefore, we ineluded an item level feature from our own previous researeh (the Forced 
Comparison feature) as well as identified a new item level feature that arose during preliminary 
analysis of items. Both of these features are discussed below. 

Forced Comparison. This feature was defined as an item that typically (a) used Which of the 
following, (b) named a eategory of what was sought: “Whieh of the following drawings... ”, (c) 
asked students for an end of scale value (e.g., best shows or most likely result), and (d) had a verb 
or noun associated with the end of seale value (e.g., best shows or most likely result) . A question 
statement eontaining the Foreed Comparison is: Which of the following drawings best shows the 
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life cycle of a berry bush? Items with the presence of the Foreed Comparison received a score of 
a I. 

Reference Back. During the pilot study, preliminary analysis of test items with the Foreed 
Comparison feature uncovered another feature that was related to the relationship between 
sentences in the item. We defined the Referenee Back feature as a question sentence that 
required the student to return to the text of a previous sentence in the item to identify information 
neeessary to answer the question. In some cases, this feature was instantiated in an explicit 
anaphoric reference. For example, if an item’s question statement asked. Where did this rock 
most likely form?, students would need to refer back to a previous sentenee to find out what this 
rock was. In other cases, the Reference Baek feature oecurred when the question sentence had no 
explicit reference to prior information but nonetheless required students to refer back to previous 
parts of the question to construct an understanding of what the question asked. This feature score 
was diehotomously seored. 

Qualitative Data Sources 

Student interviews. In addition to calculating performance of students statewide, we 
gathered detailed qualitative data of how students interacted with items eontaining these two 
linguistie features. We interviewed 52 ELLs from 3 districts about 32 different test items with 
and without the Forced Comparison and Reference Back features. For eaeh interview, students 
answered six multiple-choice items in either (a) the original form containing these features, or 
(b) a modified form ereated by the project team that removed these features. After students 
selected an answer for eaeh item, bilingual interviewers asked the students (1) how they solved 
the items, (2) whether they understood specific linguistic features of the items, and (3) whether 
they knew the science content being assessed. 
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Data Analysis 

Quantitative Coding and Analysis. Two eoders with experienee in teaching and educational 
research independently were trained to eode all 162 items for the presence or absence of features, 
including the Foreed Comparison and Reference Back. For the purposes of coding, the Forced 
Comparison was identified any item containing an end of scale value (e.g., best or most likely). 

To eode the reference back feature, coders were given items with only the question statement and 
the answer options. If the coders deemed it possible to answer the item with only seeing the 
question statement, the item was eoded as not having the Reference Back feature. If the coders 
decided it was not possible to answer the item when only reading the question statement, the 
item was eoded as containing the Reference Back feature. Coders were given items in three 
rounds, randomly determined, to code independently. After eoding these items independently, 
the eoders met to discuss any discrepancies and to arrive at consensus. 

We calculated Differential Item Funetioning (DIF) using the Standardization method (Dorans 
& Kuliek, 1986) to determine whieh test items showed differences in the probabilities of 
answering correctly for LEPs, FLEPs, and non-EEEs who were at the same ability levels. DIP 
values ealculated using a second method, HEM-ER, were significantly correlated with DIP 
values caleulated using Standardization method (.873, p <.001). Spearman’s Rank correlation 
measured the assoeiation between items’ DIP values and the presence of the two item level 
features: Porced Comparison and Referenee Back. 

Qualitative Coding and Analysis. Two coders independently eoded student interviews and 
discussed discrepancies to arrive at consensus. Drawing from student responses, the coders 
identified if the student: (1) seleeted the correct response, (2) understood speeifie linguistic 
features of the item, and (3) demonstrated knowledge of the eonstruct the item assessed. 
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Insert more on data analysis for FC & RB here. 

Results 

To answer the first researeh question, What is the relationship between science strand and 
ELL performance? , we investigated the frequeney of non-negligible levels of DIF favoring non- 
ELLs over ELLs aeross the three scienee strands included in the STE MCAS. Although the total 
eorpus of 162 items was evenly distributed aeross each science strand, of the 62 test items with 
non-negligible DIE, 30 were Eife Science (ES) items, while 16 were Earth and Spaee Science 
(ESS) items, and 16 were Physical Science (PS) items, indicating that there is a pattern of greater 
EEE diffieulties with ES test items on the STE MCAS. To investigate the reasons for this 
pattern, we pursued our second research question; What is the frequency and distribution of these 
discourse features in Grade 5 science multiple-choice test items! We calculated the average 
feature seore for eaeh of these features, shown in Table 1 below. 

Table 1. Distribution of Linguistic Features 




Strand 


1 /-vFq 1 

Eeature 

ES 

ESS 

PS 

1 Olai 


(n=58) 

(n=51) 

(n=53) 

(n=162) 

Eoreed Comparison 

0.66 

0.55 

0.36 

0.52 

Reference Back 

0.34 

0.24 

0.19 

0.26 


Table 1 shows the average feature score for all items in the last column, regardless of the 
scienee strand that they fell under. The last column shows that half of the items contained the 
Eorced Comparison Eeature, one fourth of the items contained the Reference Back feature. 

The distribution of these features differed across science strands. Table 1 shows a general 
trend of ES items having the highest feature score for eaeh discourse feature. PS items had the 
lowest feature score for eaeh of these features, while Earth and Spaee had a mean greater than PS 
items but less than ES items. Some of these differences were quite large. In fact, for ES items. 
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the mean feature score for the Forced Comparison feature only and the mean feature score for the 
Forced Comparison and Reference back combined was almost twice as great as scores for the PS 
items. It appears that item writers tended to utilize certain discourse features when assessing 
topics in LS as compared to topies under the other science strands. 

To answer the second research question, How do these features relate to ELL performance?, 
we correlated these features with the items’ DIF values, as shown in Table 2. The Foreed 
Comparison feature was significantly and positively eorrelated with DIF disfavoring LEP and 
FLEP students. The Referenee Back feature was signifieantly and positively correlated with DIP 
disfavoring EEP students only. 

Table 2. 

Correlations between Einguistie Peatures of Test Items and Item DIP Values 


Peatures 

LEP 

PEEP 

Porced Comparison 

.194* 

.192* 

Reference Back 

.192* 

.101 


To answer the third researeh question. How do ELLs interact with these features? we 
analyzed student interviews. Because they were significantly correlated with DIP disfavoring 
EPPs, we specifically focused only on items with both the Porced Comparison and Reference 
Back features. Here, we summarize one case. Yolanda, a native Spanish-speaking student 
classified as PEP, ineorreetly answered the Earthworm item (shown below) that contained the 
Porced Comparison and Reference Baek features. She chose C. by staying where it was placed. 


An earthworm was plaeed on top of a 
thick layer of moist topsoil in a pan. The pan 
was plaeed in a room with the lights on. 

How did the earthworm most likely respond 
to these conditions ? 

A. by burrowing under the soil 

B. by erawling around in the pan 
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C. by staying where it was placed 

D. by trying to crawl out of the pan 

Figure 1. Earthworm item: Forced Comparison in italics, Reference Back underlined. 

Although Yolanda knew all of the words except for two terms {most likely and respond) she 

stated that she “didn’t really get” this item. Foeusing on the phrase, was placed on top of a thick 

layer of moist topsoil, Yolanda said C. was correct, “because it says it was on top of a moist 

topsoil in a pan. If it was on top, it would just be there staying steady.” Yolanda thought the 

phrase most likely respond to these conditions asked her to ehoose the best answer for what the 

earthworm was doing. It was not clear to her that respond to these conditions required her to go 

back to the first two sentences of the item to determine the earthworm’s reaetion to (a) being on 

top of moist topsoil, and (b) being placed in a room with the lights on. Rather than referring 

back, she thought these conditions referred forward to the answer ehoices. Her interpretation of 

most likely, the extreme value asked for by the Foreed Comparison feature, only intensified her 

difficulty, as she thought it meant the best answer choice. A key phrase, the lights on was buried 

in the middle of the item, where Yolanda stated she did not notice it. Nevertheless, she know a 

lot about earthworms. Later, the item was asked in a simplified form without these features. 

Yolanda immediately and correctly chose A. by burrowing under the soil, stating “earthworms 

do not like light, so it would go under the soil.” It appears that, while she knew most of the 

words, the discourse features Foreed Comparison and Referenee Baek prevented her from 

demonstrating her knowledge. 

Significance 

This study provides insight into discourse features of science test items that can be 
systematieally identified and analyzed. Our results showed that the features were frequent, but 
that the distribution of these features differed across the three seienee strands. Eaeh feature was 
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most frequent on LS items. This is problematie beeause the Foreed Comparison and Reference 
Baek features were significantly correlated with DIF disfavoring LEP and FLEP students 
indieating that they potentially hinder EEEs’ abilities to understand test items containing these 
features. Interviews confirmed that these two features in eombination interfered with EEEs’ 
abilities to make sense of the items, whieh often resulted in students answering incorrectly, even 
when they demonstrated knowledge of the content. Therefore, tests may not be obtaining 
aceurate measures of EEE knowledge on items containing these features. Beeause the features 
are most frequent on ES items, EEEs’ eontent knowledge on these topies may be severely 
underestimated by these tests. These results eall for further investigation into the ways that EEEs 
interact with these features as well as the reasons for the exeeptional incidence of these features 
arising in the ES strand. We hypothesize that item writers may have utilized these features to 
assess complex ES standards in a multiple-choiee format when an open-response format may 
have better assessed the content. 
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